Split-brain (computing)
   HOME

TheInfoList



OR:

Split-brain is a computer term, based on an analogy with the medical
Split-brain Split-brain or callosal syndrome is a type of disconnection syndrome when the corpus callosum connecting the two hemispheres of the brain is severed to some degree. It is an association of symptoms produced by disruption of, or interference wit ...
syndrome. It indicates data or availability inconsistencies originating from the maintenance of two separate data sets with overlap in scope, either because of servers in a
network design Network planning and design is an iterative process, encompassing topological design, network-synthesis, and network-realization, and is aimed at ensuring that a new telecommunications network or service meets the needs of the subscriber and op ...
, or a failure condition based on servers not communicating and synchronizing their data to each other. This last case is also commonly referred to as a
network partition A network partition is a division of a computer network into relatively independent subnets, either by design, to optimize them separately, or due to the failure of network devices. Distributed software must be designed to be partition-tolerant, t ...
. Although the term split-brain typically refers to an error state, split-brain DNS (or split-horizon DNS) is sometimes used to describe a deliberate situation where internal and external DNS services for a corporate network are not communicating, so that separate DNS name spaces are to be administered for external computers and for internal ones. This requires a double administration, and if there is domain overlap in the computer names, there is a risk that the same
fully qualified domain name A fully qualified domain name (FQDN), sometimes also referred to as an ''absolute domain name'', is a domain name that specifies its exact location in the tree hierarchy of the Domain Name System (DNS). It specifies all domain levels, including th ...
(FQDN), may ambiguously occur in both name spaces referring to different computer IP addresses.
High-availability cluster High-availability clusters (also known as HA clusters, fail-over clusters) are groups of computers that support server applications that can be reliably utilized with a minimum amount of down-time. They operate by using high availability softwa ...
s usually use a
heartbeat private network In computer science, a heartbeat is a periodic signal generated by hardware or software to indicate normal operation or to synchronize other parts of a computer system. Heartbeat mechanism is one of the common techniques in mission critical syste ...
connection which is used to monitor the health and status of each node in the cluster. For example, the split-brain syndrome may occur when all of the private links go down simultaneously, but the cluster nodes are still running, each one believing they are the only one running. The data sets of each cluster may then randomly serve clients by their own "idiosyncratic" data set updates, without any coordination with the other data sets. This may lead to
data corruption In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted. ...
or other data inconsistencies that might require operator intervention and cleanup.


Approaches for dealing with split-brain

Davidson et al., after surveying several approaches to handle the problem, classify them as either optimistic or pessimistic. The optimistic approaches simply let the partitioned nodes work as usual; this provides a greater level of availability, at the cost of sacrificing correctness. Once the problem has ended, automatic or manual reconciliation might be required in order to have the cluster in a consistent state. One current implementation for this approach is
Hazelcast In computing, Hazelcast IMDG is an open source in-memory data grid based on Java. It is also the name of the company developing the product. The Hazelcast company is funded by venture capital and headquartered in Palo Alto, California. In a ...
, which does automatic reconciliation of its key-value store. The pessimistic approaches sacrifice availability in exchange for consistency. Once a network partitioning has been detected, access to the sub-partitions is limited in order to guarantee consistency. A typical approach, as described by Coulouris et al., is to use a quorum-consensus approach. This allows the sub-partition with a majority of the votes to remain available, while the remaining sub-partitions should fall down to an auto-
fencing Fencing is a group of three related combat sports. The three disciplines in modern fencing are the foil, the épée, and the sabre (also ''saber''); winning points are made through the weapon's contact with an opponent. A fourth discipline, ...
mode. One current implementation for this approach is the one used by
MongoDB MongoDB is a source-available cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with optional schemas. MongoDB is developed by MongoDB Inc. and licensed under the Ser ...
replica sets. And another such implementation is Galera replication for
MariaDB MariaDB is a community-developed, commercially supported fork of the MySQL relational database management system (RDBMS), intended to remain free and open-source software under the GNU General Public License. Development is led by some of the ori ...
and
MySQL MySQL () is an open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A relational database ...
. Modern commercial general-purpose HA clusters typically use a combination of heartbeat network connections between cluster hosts, and quorum witness storage. The challenge with two-node clusters is that adding a witness device adds cost and complexity (even if implemented in the cloud), but without it, if heartbeat fails, cluster members cannot determine which should be active. In such clusters (without quorum), if a member fails, even if the members normally assign primary and secondary statuses to the hosts, there is at least a 50% probability that a 2-node HA cluster will totally fail until human intervention is provided, to prevent multiple members becoming active independently and either directly conflicting or corrupting data.


References

{{reflist, 30em High-availability cluster computing